{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#index进阶操作\n",
    "下面介绍的方法只支持部分Index类型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##从index中恢复出原始数据\n",
    "给定id，可以使用reconstruct或者reconstruct_n方法从index中回复出原始向量。  \n",
    "支持IndexFlat, IndexIVFFlat (需要与make_direct_map结合), IndexIVFPQ, IndexPreTransform这几类索引类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入faiss\n",
    "import sys\n",
    "import numpy as np \n",
    "sys.path.append('/home/maliqi/faiss/python/')\n",
    "import faiss\n",
    "\n",
    "#生成数据\n",
    "d = 16\n",
    "n_data = 500\n",
    "data = np.random.rand(n_data, d).astype('float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.58085376 0.5048806  0.99052334 0.5899147  0.5211166  0.35997516\n",
      " 0.7275415  0.1242122  0.08336558 0.48458952 0.3289773  0.905333\n",
      " 0.6513156  0.33422878 0.04078896 0.6842935 ]\n",
      "(10, 16)\n"
     ]
    }
   ],
   "source": [
    "index = faiss.IndexFlatL2(d)\n",
    "index.add(data)\n",
    "re_data = index.reconstruct(0)  #指定需要恢复的向量的id,每次只能恢复一个向量\n",
    "print(re_data)\n",
    "re_data_n = index.reconstruct_n(0, 10) #从第0个向量开始，连续取10个\n",
    "print(re_data_n.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##从index中移除向量\n",
    "使用remove_ids方法可以移除Index中的部分向量，调用了IDSelector对象（或IDSelectorBatch批量操作）标识每个向量是否应该被移除。因为要遍历标识数据库中的每一个向量，所以只有在需要移除大部分向量时才建议使用。   \n",
    "支持IndexFlat, IndexIVFFlat, IndexIVFPQ, IDMap。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "500\n",
      "495\n"
     ]
    }
   ],
   "source": [
    "index = faiss.IndexFlatL2(d)\n",
    "index.add(data)\n",
    "print(index.ntotal)\n",
    "index.remove_ids(np.arange(5)) # 需要移除的向量的id\n",
    "print(index.ntotal)  #移除了5个向量，还剩495个"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##搜索距离范围内的向量\n",
    "以查询向量为中心，返回距离在一定范围内的结果，如返回数据库中与查询向量距离小于0.3的结果。  \n",
    "支持IndexFlat, IndexIVFFlat，只支持在CPU使用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(array([0, 8], dtype=uint64), array([0.        , 1.165087  , 0.92170537, 0.9101888 , 1.2231735 ,\n",
      "       1.2296542 , 1.2302384 , 1.1056653 ], dtype=float32), array([ 49, 135, 150, 225, 266, 323, 484, 491]))\n",
      "(array([ 0, 26], dtype=uint64), array([1.2187614 , 0.        , 1.2426732 , 0.82170576, 1.1128769 ,\n",
      "       0.8076687 , 1.2431146 , 0.9778436 , 1.2443304 , 1.1967008 ,\n",
      "       1.1036559 , 1.1283486 , 1.1076214 , 1.2520782 , 1.2406417 ,\n",
      "       1.2235129 , 1.0338147 , 1.1743065 , 0.9288659 , 1.1673778 ,\n",
      "       1.1726046 , 1.1790745 , 1.1337838 , 1.1365123 , 1.2428    ,\n",
      "       1.0492276 ], dtype=float32), array([  6,   9,  11,  15,  41,  47,  50,  58,  75, 104, 108, 112, 122,\n",
      "       135, 162, 169, 213, 236, 271, 290, 342, 434, 463, 467, 477, 479]))\n"
     ]
    }
   ],
   "source": [
    "index = faiss.IndexFlatL2(d)\n",
    "index.add(data)\n",
    "dist = float(np.linalg.norm(data[3] - data[0])) * 0.99  # 定义一个半径/阈值\n",
    "res_index = index.range_search(data[[49], :], dist)  #用第50个向量查询\n",
    "print(res_index) #返回结果是一个三元组，分别是limit(返回的结果的数量), distance, index\n",
    "res_index = index.range_search(data[[9], :], dist)  #用第10个向量查询\n",
    "print(res_index) #返回结果是一个三元组，分别是limit(返回的结果的数量), distance, index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##拆分/合并index\n",
    "可以将多个index合并，需要注意的是，多个Index的数据应该满足同一分布，并且用同一分布的数据训练index，如果多个Index的数据分布不同，合并时并不会报错，但在理论上会降低索引的精度，应该用与合并后的数据集同分布的训练集再次训练。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "500\n",
      "[[  0  28 382 194 286 114 308 480 254 279]\n",
      " [  1 416 272 250 296 138 366 281  93 169]\n",
      " [  2  44 491 231 178 285 117 273  83 187]\n",
      " [  3 194  28 143 270 430 264 382 197 279]\n",
      " [  4 464 317  89 325 498  83 101 285  51]]\n"
     ]
    }
   ],
   "source": [
    "nlist = 10\n",
    "quantizer = faiss.IndexFlatL2(d)\n",
    "index1 = faiss.IndexIVFFlat(quantizer, d, nlist)\n",
    "index1.train(data)\n",
    "index1.add(data[:250])\n",
    "index2 = faiss.IndexIVFFlat(quantizer, d, nlist)\n",
    "index2.add(data[250:])\n",
    "index1.merge_from(index2, 250)\n",
    "print(index1.ntotal) # 合并后应该包含500个向量\n",
    "dis, ind = index1.search(data[:5], 10)\n",
    "print(ind)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2.0
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}